Abstract
They reveal that much of the performance gains of word embeddings are due to certain system design choices, rather than the embedding algorithms themselves. Furthermore, they show that these modifications can be transferred to traditional distributional models, yielding similart gains. A recent study by Baroni et al. (2014) shows that new embedding methods consistently outperform the traditional methods by a non-trivial margin on many similarity-oriented tasks. But analysis by Levy and Goldberg shows that word2vec’s SGNS is implicitly factorizing a word-context PMI matrix.
Background
Four word representation methods are considered:
- the explicit PPMI matrix
- SVD factorization of said matrix
- SGNS
- GloVe
PPMI Matrix
$PMI(w,c)=log \frac{\hat{P}(w,c)}{\hat{P}(w)\hat{P}(c)}=log\frac{\#(w,c)\cdot |D| }{\#(w) \cdot \#(c)}$
$PPMI(w,c)=max(PMI(w,c),0)$
A well-known shortcoming of PMI, which persists in PPMI, is its bias towards infrequent events.
Transferable Hyperparameters
Adapt and apply the hyperparameters to count-based methods.
- pre-processing hyperparameters
- association metric hyperparameters
- post-processing hyperparameters
Pre-processing hyperparameters
Dynamic Context Windows (dyn)
context words can be weighted according to their distance from the focus word.
Subsampling
Subsampling is a method of diluting very frequent words, akin to removing stop-words.
Deleting Rare Words (del)
Delete rare words before creating context windows.
Association Metric Hyperparameters
- Shifted PMI(neg)
- Context distribution smoothing
Post-processing hyperparameters
- Adding context vector
- Eigenvalue Weighting
- Vector Normalization: the standard L2 normalization of $W$’s rows is consistently superior.
Experiments
Word Similarity
Six datasets:
- WordSim-353: divided into two datasets:
WordSim Similarity
andWordSim Relatedness
- MEN dataset
- Mechanical Turk dataset
- Rare Words dataset
- SimLex-999 dataset
Analogy
- MSR’s analogy dataset
- Google’s analogy dataset
Results
At times, changing hyperparameters can bring bigger improvement than changing to different representation method. In some tasks, careful hyperparameters tunning can also outweigh the importance of adding more data.
SVD is very useful. word2vec outperforms GloVe.
The prediction-based word embeddings are not superior to count-based approaches. The contradictory results in stem from creating word2vec embeddings with somewhat pre-tunned hyperparameters (recommended by word2vec), and comparing them to “vanilla” PPMI and SVD representations.
3CosMul dominates 3CosAdd in every case.
There are a few works show that CBOW
has s slight advantage compared to others. But in the paper of word2vec, it shows SGNS performs better.
Hyperparameter Analysis
Harmful Configurations
- SVD does not benefit from shifted PPMI
- Using SVD “correctly” is bad
Beneficial Configurations
PPMI
and SVD
‘s preference towards shorter context windows (win=2). SGNS
always prefers numerous negative samples (neg>1). Only hyperparameter that can be “blindly” applied in any situation is context distribution smoothing (cds=0.75).
Practical Recommendations
- Always use context distribution smoothing to modify PMI.
- Do not use SVD “correctly” (eig=1).
- SGNS is a robust baseline. While it might not be the best method for every task, it does not significantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory consumption.
- With SGNS, prefer many negative samples.
- For both SGNS and GloVe, it is worthwhile to experiment with $w+c$ variant.